Current Issue : April - June Volume : 2018 Issue Number : 2 Articles : 5 Articles
Robustness against background noise is a major research area for speech-related applications such as speech\nrecognition and speaker recognition. One of the many solutions for this problem is to detect speech-dominant\nregions by using a voice activity detector (VAD). In this paper, a second-order polynomial regression-based\nalgorithm is proposed with a similar function as a VAD for text-independent speaker verification systems. The\nproposed method aims to separate steady noise/silence regions, steady speech regions, and speech onset/offset\nregions. The regression is applied independently to each filter band of a mel spectrum, which makes the algorithm\nfit seamlessly to the conventional extraction process of the mel-frequency cepstral coefficients (MFCCs). The kmeans\nalgorithm is also applied to estimate average noise energy in each band for spectral subtraction. A pseudo\nSNR-dependent linear thresholding for the final VAD output decision is introduced based on the k-means energy\ncenters. This thresholding considers the speech presence in each band. Conventional VADs usually neglect the\ndeteriorative effects of the additive noise in the speech regions. Contrary to this, the proposed method decides not\nonly for the speech presence, but also if the frame is dominated by the speech, or the noise. Performance of the\nproposed algorithm is compared with a continuous noise tracking method, and another VAD method in speaker\nverification experiments, where five different noise types at five different SNR levels were considered. The proposed\nalgorithm showed superior verification performance both with the conventional GMM-UBM method, and the stateof-\nthe-art i-vector method....
A novel method for audio time stretching has been developed. In time stretching, the audio\nsignal�s duration is expanded, whereas its frequency content remains unchanged. The proposed time\nstretching method employs the new concept of fuzzy classification of time-frequency points, or bins,\nin the spectrogram of the signal. Each time-frequency bin is assigned, using a continuous membership\nfunction, to three signal classes: tonalness, noisiness, and transientness. The method does not require\nthe signal to be explicitly decomposed into different components, but instead, the computing of phase\npropagation, which is required for time stretching, is handled differently in each time-frequency\npoint according to the fuzzy membership values. The new method is compared with three previous\ntime-stretching methods by means of a listening test. The test results show that the proposed method\nyields slightly better sound quality for large stretching factors as compared to a state-of-the-art\nalgorithm, and practically the same quality as a commercial algorithm. The sound quality of all\ntested methods is dependent on the audio signal type. According to this study, the proposed\nmethod performs well on music signals consisting of mixed tonal, noisy, and transient components,\nsuch as singing, techno music, and a jazz recording containing vocals. It performs less well on music\ncontaining only noisy and transient sounds, such as a drum solo. The proposed method is applicable\nto the high-quality time stretching of a wide variety of music signals....
Large vocabulary continuous speech recognition (LVCSR) has naturally been demanded for transcribing daily\nconversations, while developing spoken text data to train LVCSR is costly and time-consuming. In this paper, we\npropose a classification-based method to automatically select social media data for constructing a spoken-style\nlanguage model in LVCSR. Three classification techniques, SVM, CRF, and LSTM, trained by words and parts-of-speech\nare comparatively experimented to identify the degree of spoken style in each social media sentence. Spoken-style\nutterances are chosen by incremental greedy selection based on the score of the SVM or the CRF classifier or the\noutput classified as ââ?¬Å?spokenââ?¬Â by the LSTM classifier. With the proposed method, just 51.8, 91.6, and 79.9% of the\nutterances in a Twitter text collection are marked as spoken utterances by the SVM, CRF, and LSTM classifiers,\nrespectively. A baseline language model is then improved by interpolating with the one trained by these selected\nutterances. The proposed model is evaluated on two Thai LVCSR tasks: social media conversations and a\nspeech-to-speech translation application. Experimental results show that all the three classification-based data\nselection methods clearly help reducing the overall spoken test set perplexities. Regarding the LVCSR word error rate\n(WER), they achieve 3.38, 3.44, and 3.39% WER reduction, respectively, over the baseline language model, and 1.07,\n0.23, and 0.38% WER reduction, respectively, over the conventional perplexity-based text selection approach...
Audio signals are a type of high-dimensional data, and their clustering is critical. However, distance calculation\nfailures, inefficient index trees, and cluster overlaps, derived from the equidistance, redundant attribute, and sparsity,\nrespectively, seriously affect the clustering performance. To solve these problems, an audio-signal clustering\nalgorithm based on the sequential Psim matrix and Tabu Search is proposed. First, the audio signal similarity is\ncalculated with the Psim function, which avoids the equidistance. The data is then organized using a sequential\nPsim matrix, which improves the indexing performance. The initial clusters are then generated with differential\ntruncation and refined using the Tabu Search, which eliminates cluster overlap. Finally, the K-Medoids algorithm is\nused to refine the cluster. This algorithm is compared to the K-Medoids and spectral clustering algorithms using\nUCI waveform datasets. The experimental results indicate that the proposed algorithm can obtain better Macro-F1\nand Micro-F1 values with fewer iterations....
In speech enhancement, noise power spectral density (PSD) estimation plays a key role in determining appropriate\nde-nosing gains. In this paper, we propose a robust noise PSD estimator for binaural speech enhancement in timevarying\nnoise environments. First, it is shown that the noise PSD can be numerically obtained using an eigenvalue\nof the input covariance matrix. A simplified estimator is then derived through an approximation process, so that the\nnoise PSD is expressed as a combination of the second eigenvalue of the input covariance matrix, the noise\ncoherence, and the interaural phase difference (IPD) of the input signal. Later, to enhance the accuracy of the noise\nPSD estimate in time-varying noise environments, an eigenvalue compensation scheme is presented, in which two\neigenvalues obtained in noise-dominant regions are combined using a weighting parameter based on the speech\npresence probability (SPP). Compared with the previous prediction filter-based approach, the proposed method\nrequires neither causality delays nor explicit estimation of the prediction errors. Finally, the proposed noise PSD\nestimator is applied to a binaural speech enhancement system, and its performance is evaluated through computer\nsimulations. The simulation results show that the proposed noise PSD estimator yields accurate noise PSD\nregardless of the direction of the target speech signal. Therefore, slightly better performance in quality and\nintelligibility can be obtained than that with conventional algorithms...
Loading....